R has many contributed to solve specific needs and problems in several disciplines. These are usually hosted in repositories, such as CRAN (https://cran.r-project.org/), Bioconductor (https://www.bioconductor.org/), R-Forge (https://r-forge.r-project.org/), ROpenSci (https://ropensci.org/), Stan packages (http://mc-stan.org/r-packages/)… Yet other packages are hosted in version control websites as GitHub or GitLab.
In all cases, to make use of a package, it has to be installed and loaded in memory.
installed.packages()[,1] # Lists the installed packages
(.packages()) # Lists the packages currently loadedSome relevant resources to find useful packages are https://www.rdocumentation.org/ and https://cran.rstudio.com/web/views/.
To install a package, we use
To load a package in memory, we use
Alternatively, RStudio offers a menu-based package management feature.
In a script, to avoid repeated installations of a package and to ensure its availability, we can use
If the package is installed but not loaded, we can still access its functions by using the ::operator.
To browse the help files for a packages, run
Additionnaly some packages have a vignette and some demo code that can be inspected with the vignette() and demo() functions.
Last, to learn more about packages and package development, you may want to read https://r-pkgs.org/index.html.
It is often useful to save and load data for its transfer among different application or just to keep a copy of it for back-up or later use.
Although R can read and write many different types of files, we will focus on the most relevant format for statistics, chemistry and chemometrics. We will discuss here how to handle
Before working with external files, it is important to set the working directory. setwd and getwd functions can be used to set the working directory from the R code, although its use is not recommended in scripts that may need to be shared.
RStudio allows setting the working directory from the Session menu and from the Files pane.
Let’s start from the wine data set in the FactoMineR package.
## 'data.frame': 21 obs. of 31 variables:
## $ Label : Factor w/ 3 levels "Saumur","Bourgueuil",..: 1 1 2 3 1 2 2 1 3 1 ...
## $ Soil : Factor w/ 4 levels "Reference","Env1",..: 2 2 2 3 1 1 1 2 2 3 ...
## $ Odor.Intensity.before.shaking: num 3.07 2.96 2.86 2.81 3.61 ...
## $ Aroma.quality.before.shaking : num 3 2.82 2.93 2.59 3.43 ...
## $ Fruity.before.shaking : num 2.71 2.38 2.56 2.42 3.15 ...
## $ Flower.before.shaking : num 2.28 2.28 1.96 1.91 2.15 ...
## $ Spice.before.shaking : num 1.96 1.68 2.08 2.16 2.04 ...
## $ Visual.intensity : num 4.32 3.22 3.54 2.89 4.39 ...
## $ Nuance : num 4 3 3.39 2.79 4.04 ...
## $ Surface.feeling : num 3.27 2.81 3 2.54 3.38 ...
## $ Odor.Intensity : num 3.41 3.37 3.25 3.16 3.54 ...
## $ Quality.of.odour : num 3.31 3 2.93 2.88 3.36 ...
## $ Fruity : num 2.88 2.56 2.77 2.39 3.16 ...
## $ Flower : num 2.32 2.44 2.19 2.08 2.23 ...
## $ Spice : num 1.84 1.74 2.25 2.17 2.15 ...
## $ Plante : num 2 2 1.75 2.3 1.76 ...
## $ Phenolic : num 1.65 1.38 1.25 1.48 1.6 ...
## $ Aroma.intensity : num 3.26 2.96 3.08 2.54 3.62 ...
## $ Aroma.persistency : num 2.96 2.81 2.8 2.58 3.3 ...
## $ Aroma.quality : num 3.2 2.93 3.08 2.48 3.46 ...
## $ Attack.intensity : num 2.96 3.04 3.22 2.7 3.46 ...
## $ Acidity : num 2.11 2.11 2.18 3.18 2.57 ...
## $ Astringency : num 2.43 2.18 2.25 2.18 2.54 ...
## $ Alcohol : num 2.5 2.65 2.64 2.5 2.79 ...
## $ Balance : num 3.25 2.93 3.32 2.33 3.46 ...
## $ Smooth : num 2.73 2.5 2.68 1.68 3.04 ...
## $ Bitterness : num 1.93 1.93 2 1.96 2.07 ...
## $ Intensity : num 2.86 2.89 3.07 2.46 3.64 ...
## $ Harmony : num 3.14 2.96 3.14 2.04 3.64 ...
## $ Overall.quality : num 3.39 3.21 3.54 2.46 3.74 ...
## $ Typical : num 3.25 3.04 3.18 2.25 3.44 ...
To write a delimited text file from a data frame, we use the write.table function or any of its derivative functions. To read, read.table is to be used.
write.table(wine, file="wine.csv", sep=",", dec=".",
quote=TRUE, fileEncoding="UTF-8", row.names=FALSE)In case we want a tab-delimited text file, we will use \t for the argument sep.
Other options exist for managing text-files.
In case the format is not well-defined, we may want to use readLines and writeLines to read or write the file line by line.
If the file is large, read_table from the readr package or vroom from the vroom package are usually faster.
An example of an XML file containing the structure of caffeine can be found at https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/CID/2519/record/XML. It can be read in R with read_xml from the xml2 package.
Information can be extracted using XPath selectors, https://www.w3schools.com/xml/xpath_intro.asp.
xmlObj <- read_xml(
"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/CID/2519/record/XML",
options = "RECOVER")
xml_ns_strip(xmlObj) # Important!
xmlObj_Name <- xml_find_all(xmlObj,xpath=
".//PC-InfoData//*[text()='Traditional']/../../..//PC-InfoData_value_sval/text()")
as.character(xmlObj_Name)## [1] "caffeine"
NMR spectra can be obtained in CML (Chemistry Markup Language, http://www.xml-cml.org/spec/) format from https://www.nmrshiftdb.org.
For example, the 13C-NMR spectra for butane is available at https://www.nmrshiftdb.org/NmrshiftdbServlet/nmrshiftdbaction/searchorpredict/smiles/CCCC/spectrumtype/13C.
JSON files are another common format for data storage and transmission. In R, the jsonlite package offers functions to import and convert this files to list or data frames.
jsBz <- fromJSON(
"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/241/JSON")
jsBz_NI <- jsBz$Record$Section[
jsBz$Record$Section$TOCHeading=="Names and Identifiers",]
jsBz_NI_OI <- jsBz_NI$Section[[1]][
jsBz_NI$Section[[1]]$TOCHeading=="Other Identifiers",]
jsBz_NI_OI$Section[[1]]$TOCHeading## [1] "CAS" "Related CAS"
## [3] "Deprecated CAS" "European Community (EC) Number"
## [5] "ICSC Number" "NSC Number"
## [7] "RTECS Number" "UN Number"
## [9] "UNII" "DSSTox Substance ID"
## [11] "Wikipedia" "NCI Thesaurus Code"
CAS <- jsBz_NI_OI$Section[[1]][jsBz_NI_OI$Section[[1]]$TOCHeading=="CAS",]
table(unlist(CAS$Information[[1]]$Value$StringWithMarkup))##
## 26181-88-4 71-43-2
## 1 13
Let’s use the wine data set again.
## 'data.frame': 21 obs. of 31 variables:
## $ Label : Factor w/ 3 levels "Saumur","Bourgueuil",..: 1 1 2 3 1 2 2 1 3 1 ...
## $ Soil : Factor w/ 4 levels "Reference","Env1",..: 2 2 2 3 1 1 1 2 2 3 ...
## $ Odor.Intensity.before.shaking: num 3.07 2.96 2.86 2.81 3.61 ...
## $ Aroma.quality.before.shaking : num 3 2.82 2.93 2.59 3.43 ...
## $ Fruity.before.shaking : num 2.71 2.38 2.56 2.42 3.15 ...
## $ Flower.before.shaking : num 2.28 2.28 1.96 1.91 2.15 ...
## $ Spice.before.shaking : num 1.96 1.68 2.08 2.16 2.04 ...
## $ Visual.intensity : num 4.32 3.22 3.54 2.89 4.39 ...
## $ Nuance : num 4 3 3.39 2.79 4.04 ...
## $ Surface.feeling : num 3.27 2.81 3 2.54 3.38 ...
## $ Odor.Intensity : num 3.41 3.37 3.25 3.16 3.54 ...
## $ Quality.of.odour : num 3.31 3 2.93 2.88 3.36 ...
## $ Fruity : num 2.88 2.56 2.77 2.39 3.16 ...
## $ Flower : num 2.32 2.44 2.19 2.08 2.23 ...
## $ Spice : num 1.84 1.74 2.25 2.17 2.15 ...
## $ Plante : num 2 2 1.75 2.3 1.76 ...
## $ Phenolic : num 1.65 1.38 1.25 1.48 1.6 ...
## $ Aroma.intensity : num 3.26 2.96 3.08 2.54 3.62 ...
## $ Aroma.persistency : num 2.96 2.81 2.8 2.58 3.3 ...
## $ Aroma.quality : num 3.2 2.93 3.08 2.48 3.46 ...
## $ Attack.intensity : num 2.96 3.04 3.22 2.7 3.46 ...
## $ Acidity : num 2.11 2.11 2.18 3.18 2.57 ...
## $ Astringency : num 2.43 2.18 2.25 2.18 2.54 ...
## $ Alcohol : num 2.5 2.65 2.64 2.5 2.79 ...
## $ Balance : num 3.25 2.93 3.32 2.33 3.46 ...
## $ Smooth : num 2.73 2.5 2.68 1.68 3.04 ...
## $ Bitterness : num 1.93 1.93 2 1.96 2.07 ...
## $ Intensity : num 2.86 2.89 3.07 2.46 3.64 ...
## $ Harmony : num 3.14 2.96 3.14 2.04 3.64 ...
## $ Overall.quality : num 3.39 3.21 3.54 2.46 3.74 ...
## $ Typical : num 3.25 3.04 3.18 2.25 3.44 ...
To write a data.frame to Excel, we can use the write_xlsx from the writexl package. To read an Excel file, read_xlsx from the readxl package can be used.
To open and write Matlab data files, we can use the R.matlab package. Relevant functions are called readMat and writeMat.
writeMat("wine.mat", wine = wine)
wineMat <-readMat("wine.mat")
wineMat <- data.frame(as.character(unlist(wineMat[[1]][[1]])),
as.character(unlist(wineMat[[1]][2])),
as.data.frame(wineMat[[1]][3:31]))Reading a Matlab data file returns a list that will need ad hoc manipulation.
To save and read data in the R data format, we use save and load. These allow storing and restoring any set of variables of the environment. When loaded, variables are recovered with the same names they had on saving.
Besides the general data file formats already discussed, there are several formats used specifically to store chemical information. Some significant ones are
More information on chemical data files can be found at https://en.wikipedia.org/wiki/Chemical_file_format, https://en.wikipedia.org/wiki/Mass_spectrometry_data_format, http://unichrom.com/chrom/uc-ffe.shtml or https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3518119/.
Chemical Table Files are a set of text-based file types designed to store molecular information, e.g. positions of the atoms and connection tables. Some of this format allow storing additional information such as conformers, additional properties and identifiers, or spectra.
Currently the simplest way to read a MOL or SDF file in R is the read.SDFset function in the ChemmineR Bioconductor package.
## $CMP1
## Molecule_Name
## "C4H10"
## Source
## "APtclcactv01172209173D 0 0.00000 0.00000"
## Comment
## " "
## Counts_Line
## " 14 13 0 0 0 0 0 0 0 0999 V2000"
## CMP
## 58.1222
JCAMP-DX is an text-based open standard to store and distribute spectral data. Currently the best way to read these files in R is the readJDX function in the readJDX package.
jdx1 <- readJDX ("../data/67-64-1-IR.jdx")
plot(jdx1$Acetone$x,jdx1$Acetone$y, type="l",
xlab="wave number", ylab="T")JCAMP-DX downloaded from https://webbook.nist.gov/cgi/cbook.cgi?ID=C67641, in NIST Chemistry WebBook, NIST Standard Reference Database Number 69, Eds. P.J. Linstrom and W.G. Mallard, National Institute of Standards and Technology, Gaithersburg MD, 20899, https://doi.org/10.18434/T4D303, (on January 5, 2022).
MSP is a text-based NIST-promoted format to store collections of spectra. It’s a common format in the metabolomics community.
Downloadable spectral databases can be found at
While a functional package to read MSP files doesn’t seem to be available, these files can be read with standard functions for text files.
msp1 <- readLines("../data/MSMS-Neg-MassBankEU.msp")
msp1 <- paste(msp1, collapse="\n")
msp1 <- unlist(strsplit(msp1, "\n\n"))
msp1_1 <- unlist(strsplit(msp1[1],"\n"))
spectrum <- msp1_1[1:length(msp1_1) > which(substr(msp1_1,1,10)=="Num Peaks:")]
spectrum <- read.table(text=spectrum, sep="\t")
colnames(spectrum) <- c("m_z","int")
spectrum## m_z int
## 1 96.9602 70
## 2 241.0031 1000
::: {.bibref} MSMS-Neg-MassBankEU.msp downloaded from http://prime.psc.riken.jp/compms/msdial/main.html#MSP
As already presented, we usually have to arrange data we have in a data frame by
Many of these operations can be performed by using a set of functions that operate in a single common paradigm, sometimes defined as grammar for data manipulation, implemented in the dplyr and tidyr R packages.
We usually install and load both packages jointly with some other packages that work nicely together. These are grouped in the tidyverse package.
R supports a functional approach to programming where functions can be defined and transferred as arguments. Many calculations in R are integrated as functions. Functions are a critical concept in a tidy data management.
Functions are defined with the function reserved word (or \ since R 4.1.0).
## [1] 16
## [1] 81
%>% and |>Pipes are a syntax variant for transferring arguments to functions. It allows for a more process oriented organization of the code. R includes two types of pipes: %>% from the magrittr package and |> included in R 4.1.0.
## [1] 16
## [1] 16
## [1] 16
## [1] 4096
## [1] 4096
By default, the left operand of the pipe is sent as the first argument of the function on the right. It can be sent to a different position in the calculation by including a dot (.) in the right-hand side of the pipe.
## [1] 1.558329
## [1] 1.558329
dplyr and tidyrdplyr and tidyr are two packages of the tidyverse designed for manipulating and tidying data. A full description of both packages can be found at https://dplyr.tidyverse.org/ and https://tidyr.tidyverse.org/.
Current cheatsheets for both packages can be found at https://www.rstudio.com/resources/cheatsheets/.
Data manipulation in these packages is commonly done using a limited set of functions. The most important are
select and filter, for subsetting,mutate, for creating or updating calculated columns,arrange, for sorting,pivot_wider and pivot_longer, for reshaping, andgroup by and summarize, for aggregating.Let’s see how we operate in this logic, using as an example the wine data set.
## 'data.frame': 21 obs. of 31 variables:
## $ Label : Factor w/ 3 levels "Saumur","Bourgueuil",..: 1 1 2 3 1 2 2 1 3 1 ...
## $ Soil : Factor w/ 4 levels "Reference","Env1",..: 2 2 2 3 1 1 1 2 2 3 ...
## $ Odor.Intensity.before.shaking: num 3.07 2.96 2.86 2.81 3.61 ...
## $ Aroma.quality.before.shaking : num 3 2.82 2.93 2.59 3.43 ...
## $ Fruity.before.shaking : num 2.71 2.38 2.56 2.42 3.15 ...
## $ Flower.before.shaking : num 2.28 2.28 1.96 1.91 2.15 ...
## $ Spice.before.shaking : num 1.96 1.68 2.08 2.16 2.04 ...
## $ Visual.intensity : num 4.32 3.22 3.54 2.89 4.39 ...
## $ Nuance : num 4 3 3.39 2.79 4.04 ...
## $ Surface.feeling : num 3.27 2.81 3 2.54 3.38 ...
## $ Odor.Intensity : num 3.41 3.37 3.25 3.16 3.54 ...
## $ Quality.of.odour : num 3.31 3 2.93 2.88 3.36 ...
## $ Fruity : num 2.88 2.56 2.77 2.39 3.16 ...
## $ Flower : num 2.32 2.44 2.19 2.08 2.23 ...
## $ Spice : num 1.84 1.74 2.25 2.17 2.15 ...
## $ Plante : num 2 2 1.75 2.3 1.76 ...
## $ Phenolic : num 1.65 1.38 1.25 1.48 1.6 ...
## $ Aroma.intensity : num 3.26 2.96 3.08 2.54 3.62 ...
## $ Aroma.persistency : num 2.96 2.81 2.8 2.58 3.3 ...
## $ Aroma.quality : num 3.2 2.93 3.08 2.48 3.46 ...
## $ Attack.intensity : num 2.96 3.04 3.22 2.7 3.46 ...
## $ Acidity : num 2.11 2.11 2.18 3.18 2.57 ...
## $ Astringency : num 2.43 2.18 2.25 2.18 2.54 ...
## $ Alcohol : num 2.5 2.65 2.64 2.5 2.79 ...
## $ Balance : num 3.25 2.93 3.32 2.33 3.46 ...
## $ Smooth : num 2.73 2.5 2.68 1.68 3.04 ...
## $ Bitterness : num 1.93 1.93 2 1.96 2.07 ...
## $ Intensity : num 2.86 2.89 3.07 2.46 3.64 ...
## $ Harmony : num 3.14 2.96 3.14 2.04 3.64 ...
## $ Overall.quality : num 3.39 3.21 3.54 2.46 3.74 ...
## $ Typical : num 3.25 3.04 3.18 2.25 3.44 ...
newColnames <- c("label", "soil", "odorBf", "aromaBf",
"fruityBf", "flowerBf", "spiceBf",
"visual", "nuance", "surface", "odorI",
"odorQ", "fruity", "flower", "spice",
"plante", "phenolic", "aromaI", "aromaP",
"aromaQ","attackI", "acidity", "astringency",
"alcohol", "balance", "smooth", "bitterness",
"intensity", "harmony", "overall", "typical")
wine2 <- wine %>% rename_with(function(x) newColnames)
colnames(wine2)## [1] "label" "soil" "odorBf" "aromaBf" "fruityBf"
## [6] "flowerBf" "spiceBf" "visual" "nuance" "surface"
## [11] "odorI" "odorQ" "fruity" "flower" "spice"
## [16] "plante" "phenolic" "aromaI" "aromaP" "aromaQ"
## [21] "attackI" "acidity" "astringency" "alcohol" "balance"
## [26] "smooth" "bitterness" "intensity" "harmony" "overall"
## [31] "typical"
wine2 <- wine2 %>% rownames_to_column() %>%
mutate(rowname = paste0("W",sprintf("%0.2d",1:21)))
wine2$rowname## [1] "W01" "W02" "W03" "W04" "W05" "W06" "W07" "W08" "W09" "W10" "W11" "W12"
## [13] "W13" "W14" "W15" "W16" "W17" "W18" "W19" "W20" "W21"
## rowname harmony overall typical newCol
## 1 W01 3.143 3.393 3.250 1
## 2 W02 2.964 3.214 3.036 2
## 3 W03 3.143 3.536 3.179 3
## 4 W04 2.038 2.464 2.250 4
## 5 W05 3.643 3.741 3.444 5
## 6 W06 3.500 3.643 3.393 6
## rowname harmony overall typical newCol
## 17 W17 3.250 3.536 3.269 17
## 18 W18 3.074 3.464 3.444 18
## 19 W19 2.107 2.370 2.321 19
## 20 W20 2.741 2.643 2.571 20
## 21 W21 3.000 2.852 2.750 21
## 22 W21 3.000 2.852 2.750 21
## rowname label soil overall
## 1 W05 Saumur Reference 3.741
## 2 W06 Bourgueuil Reference 3.643
## 3 W07 Bourgueuil Reference 3.714
## 4 W14 Saumur Reference 3.929
## 5 W15 Bourgueuil Env1 3.643
## 6 W16 Bourgueuil Reference 3.750
wine2 %>% rowwise() %>% mutate(worse=min(c_across(4:30))) %>%
filter(worse<1.5) %>% select(rowname, worse)## # A tibble: 8 x 2
## # Rowwise:
## rowname worse
## <chr> <dbl>
## 1 W02 1.38
## 2 W03 1.25
## 3 W04 1.48
## 4 W06 1.48
## 5 W07 1.48
## 6 W16 1.4
## 7 W18 1.43
## 8 W19 1.32
## 'data.frame': 22 obs. of 1 variable:
## $ overall: num 3.39 3.21 3.54 2.46 3.74 ...
## num [1:22] 3.39 3.21 3.54 2.46 3.74 ...
## harmony overall typical
## 17 3.250 3.536 3.269
## 18 3.074 3.464 3.444
## 19 2.107 2.370 2.321
## 20 2.741 2.643 2.571
## 21 3.000 2.852 2.750
## 22 3.000 2.852 2.750
## [1] "W01" "W02" "W04" "W08" "W09" "W10" "W12" "W18" "W19" "W20" "W21" "W21"
qualityLevel <- data.frame(level=1:4,
qualityNom = c("Horrendous","Bad", "Good", "Excellent"))
wine2 <- wine2 %>% mutate(level=round(overall,0)) %>%
inner_join(qualityLevel)## Joining, by = "level"
## harmony overall typical level qualityNom
## 1 3.143 3.393 3.250 3 Good
## 2 2.964 3.214 3.036 3 Good
## 3 3.143 3.536 3.179 4 Excellent
## 4 2.038 2.464 2.250 2 Bad
## 5 3.643 3.741 3.444 4 Excellent
## 6 3.500 3.643 3.393 4 Excellent
The same information can be presented in many different ways, even in a data table. These are commonly referred as formats o shapes.
Having tidy data is critical to facilitate data preparation, analysis and visualization.
Data is tidy when structured in data table where every row refers to an individual, an observational unit, or an observation; and each column refers to a variable or property of the observational unit. Data is untidy if not rectangular, if different observational types are in the same data table, if different rows refer to the same observation or observational unit, if a column contains more than one type of measurement, if values of variable are in the column headers…
Wickham, H. (2014). Tidy data. Journal of statistical software, 59(1), 1-23.
When we have many variables that correspond to the same type of measurement, a data frame can be organized in two formats: a long format and a wide format.
In wide format, rows are observational units and there are different columns for the different variables being measured.
In long format, rows are observations and there is a column for the values and a column for the variable being measured.
Different analyses and visualizations will require having the data organized in one or other format.
Please be aware that tidy data can be in wide or long format, and that data in long or wide formats can still be untidy.
The data set wine is available in a wide format. Each row is a type of wine.
Let’s convert it to long…
## tibble [609 x 5] (S3: tbl_df/tbl/data.frame)
## $ rowname : chr [1:609] "2EL " "2EL " "2EL " "2EL " ...
## $ Label : Factor w/ 3 levels "Saumur","Bourgueuil",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Soil : Factor w/ 4 levels "Reference","Env1",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ determination: chr [1:609] "Odor.Intensity.before.shaking" "Aroma.quality.before.shaking" "Fruity.before.shaking" "Flower.before.shaking" ...
## $ value : num [1:609] 3.07 3 2.71 2.28 1.96 ...
## # A tibble: 6 x 5
## rowname Label Soil determination value
## <chr> <fct> <fct> <chr> <dbl>
## 1 "2EL " Saumur Env1 Odor.Intensity.before.shaking 3.07
## 2 "2EL " Saumur Env1 Aroma.quality.before.shaking 3
## 3 "2EL " Saumur Env1 Fruity.before.shaking 2.71
## 4 "2EL " Saumur Env1 Flower.before.shaking 2.28
## 5 "2EL " Saumur Env1 Spice.before.shaking 1.96
## 6 "2EL " Saumur Env1 Visual.intensity 4.32
Let’s go back to wide…
## tibble [21 x 32] (S3: tbl_df/tbl/data.frame)
## $ rowname : chr [1:21] "2EL " "1CHA" "1FON" "1VAU" ...
## $ Label : Factor w/ 3 levels "Saumur","Bourgueuil",..: 1 1 2 3 1 2 2 1 3 1 ...
## $ Soil : Factor w/ 4 levels "Reference","Env1",..: 2 2 2 3 1 1 1 2 2 3 ...
## $ Odor.Intensity.before.shaking: num [1:21] 3.07 2.96 2.86 2.81 3.61 ...
## $ Aroma.quality.before.shaking : num [1:21] 3 2.82 2.93 2.59 3.43 ...
## $ Fruity.before.shaking : num [1:21] 2.71 2.38 2.56 2.42 3.15 ...
## $ Flower.before.shaking : num [1:21] 2.28 2.28 1.96 1.91 2.15 ...
## $ Spice.before.shaking : num [1:21] 1.96 1.68 2.08 2.16 2.04 ...
## $ Visual.intensity : num [1:21] 4.32 3.22 3.54 2.89 4.39 ...
## $ Nuance : num [1:21] 4 3 3.39 2.79 4.04 ...
## $ Surface.feeling : num [1:21] 3.27 2.81 3 2.54 3.38 ...
## $ Odor.Intensity : num [1:21] 3.41 3.37 3.25 3.16 3.54 ...
## $ Quality.of.odour : num [1:21] 3.31 3 2.93 2.88 3.36 ...
## $ Fruity : num [1:21] 2.88 2.56 2.77 2.39 3.16 ...
## $ Flower : num [1:21] 2.32 2.44 2.19 2.08 2.23 ...
## $ Spice : num [1:21] 1.84 1.74 2.25 2.17 2.15 ...
## $ Plante : num [1:21] 2 2 1.75 2.3 1.76 ...
## $ Phenolic : num [1:21] 1.65 1.38 1.25 1.48 1.6 ...
## $ Aroma.intensity : num [1:21] 3.26 2.96 3.08 2.54 3.62 ...
## $ Aroma.persistency : num [1:21] 2.96 2.81 2.8 2.58 3.3 ...
## $ Aroma.quality : num [1:21] 3.2 2.93 3.08 2.48 3.46 ...
## $ Attack.intensity : num [1:21] 2.96 3.04 3.22 2.7 3.46 ...
## $ Acidity : num [1:21] 2.11 2.11 2.18 3.18 2.57 ...
## $ Astringency : num [1:21] 2.43 2.18 2.25 2.18 2.54 ...
## $ Alcohol : num [1:21] 2.5 2.65 2.64 2.5 2.79 ...
## $ Balance : num [1:21] 3.25 2.93 3.32 2.33 3.46 ...
## $ Smooth : num [1:21] 2.73 2.5 2.68 1.68 3.04 ...
## $ Bitterness : num [1:21] 1.93 1.93 2 1.96 2.07 ...
## $ Intensity : num [1:21] 2.86 2.89 3.07 2.46 3.64 ...
## $ Harmony : num [1:21] 3.14 2.96 3.14 2.04 3.64 ...
## $ Overall.quality : num [1:21] 3.39 3.21 3.54 2.46 3.74 ...
## $ Typical : num [1:21] 3.25 3.04 3.18 2.25 3.44 ...
## # A tibble: 2 x 32
## rowname Label Soil Odor.Intensity.bef~ Aroma.quality.bef~ Fruity.before.sh~
## <chr> <fct> <fct> <dbl> <dbl> <dbl>
## 1 "2EL " Saumur Env1 3.07 3 2.71
## 2 "1CHA" Saumur Env1 2.96 2.82 2.38
## # ... with 26 more variables: Flower.before.shaking <dbl>,
## # Spice.before.shaking <dbl>, Visual.intensity <dbl>, Nuance <dbl>,
## # Surface.feeling <dbl>, Odor.Intensity <dbl>, Quality.of.odour <dbl>,
## # Fruity <dbl>, Flower <dbl>, Spice <dbl>, Plante <dbl>, Phenolic <dbl>,
## # Aroma.intensity <dbl>, Aroma.persistency <dbl>, Aroma.quality <dbl>,
## # Attack.intensity <dbl>, Acidity <dbl>, Astringency <dbl>, Alcohol <dbl>,
## # Balance <dbl>, Smooth <dbl>, Bitterness <dbl>, Intensity <dbl>, ...
## [,1]
## Odor.Intensity.before.shaking_mean 3.1111429
## Odor.Intensity.before.shaking_sd 0.2894526
## Aroma.quality.before.shaking_mean 3.0457619
## Aroma.quality.before.shaking_sd 0.2056397
## Fruity.before.shaking_mean 2.7140476
## Fruity.before.shaking_sd 0.2010245
## Flower.before.shaking_mean 2.0570000
## Flower.before.shaking_sd 0.1435949
## Spice.before.shaking_mean 1.9914762
## Spice.before.shaking_sd 0.2396620
## Visual.intensity_mean 3.9575238
## Visual.intensity_sd 0.5494639
## Nuance_mean 3.7273810
## Nuance_sd 0.5207391
## Surface.feeling_mean 3.1802857
## Surface.feeling_sd 0.2798259
## Odor.Intensity_mean 3.3952381
## Odor.Intensity_sd 0.2004894
## Quality.of.odour_mean 3.2373333
## Quality.of.odour_sd 0.2231021
## Fruity_mean 2.8471429
## Fruity_sd 0.2215963
## Flower_mean 2.1621429
## Flower_sd 0.1507466
## Spice_mean 2.1783333
## Spice_sd 0.2063411
## Plante_mean 1.9644286
## Plante_sd 0.1634933
## Phenolic_mean 1.5423810
## Phenolic_sd 0.1525924
## Aroma.intensity_mean 3.1969524
## Aroma.intensity_sd 0.2548253
## Aroma.persistency_mean 2.9827619
## Aroma.persistency_sd 0.2454870
## Aroma.quality_mean 3.0636190
## Aroma.quality_sd 0.3159952
## Attack.intensity_mean 3.1562857
## Attack.intensity_sd 0.3060454
## Acidity_mean 2.3852857
## Acidity_sd 0.2398940
## Astringency_mean 2.4402857
## Astringency_sd 0.1967954
## Alcohol_mean 2.7406667
## Alcohol_sd 0.1673895
## Balance_mean 3.1290476
## Balance_sd 0.3326547
## Smooth_mean 2.6742381
## Smooth_sd 0.4139867
## Bitterness_mean 2.0679048
## Bitterness_sd 0.1864001
## Intensity_mean 3.1659524
## Intensity_sd 0.3725651
## Harmony_mean 3.1479524
## Harmony_sd 0.4420390
## Overall.quality_mean 3.3311429
## Overall.quality_sd 0.4320131
## Typical_mean 3.1967619
## Typical_sd 0.4257324
## # A tibble: 29 x 3
## determination mean sd
## <chr> <dbl> <dbl>
## 1 Acidity 2.39 0.240
## 2 Alcohol 2.74 0.167
## 3 Aroma.intensity 3.20 0.255
## 4 Aroma.persistency 2.98 0.245
## 5 Aroma.quality 3.06 0.316
## 6 Aroma.quality.before.shaking 3.05 0.206
## 7 Astringency 2.44 0.197
## 8 Attack.intensity 3.16 0.306
## 9 Balance 3.13 0.333
## 10 Bitterness 2.07 0.186
## # ... with 19 more rows
The grammar of graphics is a theoretical approximation to the study of the graphics components. According to this, a graphical representation can be built from layers, by defining different elements
Additional elements of the representation can also be adjusted like scales, formats, annotations…, after the basic structure of the graphics is established.
Wilkinson, L. (2006). The grammar of graphics. Springer Science & Business Media.
ggplot2ggplot2 is an R package which implements the grammar of graphics. A full reference can be found at http://ggplot2.tidyverse.org.
A current cheatsheet is available at https://www.rstudio.com/resources/cheatsheets/.
Wickham, H. (2010). A layered grammar of graphics. Journal of Computational and Graphical Statistics, 19(1), 3-28.
ggplot2 is part of tidyverse package (although it can be installed and loaded independently).
ggplot2 – Layers and Graphical ElementsIn ggplot2, each graphical depiction of a data set is a layer. One or more layers make a graph.
Each layer is defined by the specification of its elements. The main ones are the data, the aesthetic mapping –the relation between the data and the graphical elements–, and the geometry.
In ggplot2, a plot is an R object and it can be stored. It is constructed additively, by summing different elements to the gg object constructed by the ggplot2 function.
For example,
ggplot2 – DataIn ggplot2, the data is the first argument of the ggplot function. It is usually a data frame –or an object that can be converted to one–.
ggplot2 – Aesthetic mappingThe aesthetic mapping establishes the relationships between data variables and graphical elements. It is the second argument of the ggplot function and it must be created from the aes support function.
The most adequate mappings will depend on the type of variable and the type of plot –geometry–.
For quantitative variables, the most usual mappings are positions –x, y…–, size, and colors –color, fill–.
For qualitative variables, we opt for positions –x, y…–, colors –color, fill–, and shape as the mappings of preference.
ggplot2 – GeometryThe geometries (geom_...) indicate the format –type– of the plot, meaning how the graphical variables are put together. One or more geometries (each will be a layer) can be included in a plot by adding them to the gg object.
The data and the aesthetic mapping are inherited from the gg object but can be overrode if defined as arguments in the geom_... function. CAUTION: the order of the arguments in the geom_... is reversed from the order in the ggplot function; aesthetic mapping precedes the data.
Common geometries are
geom_pointgeom_line, geom_vline, geom_hlinegeom_bargeom_histogramgeom_boxplotFor each geometry, it is important to check what are the required and available mappings. This is available in the help pages and in the cheatsheet available at https://www.rstudio.com/resources/cheatsheets/.
ggplot2 GraphsWe will discuss here how to create the most common graphs in ggplot2. We will be adding specific considerations where required. We will explore
By default, the plots in ggplot2 are constructed from a data frame from which the variables to be represented are selected.
We will use a sample of 1000 points from the diamonds data set –included in the ggplot2 package– to create the examples.
## tibble [1,000 x 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:1000] 0.78 0.51 0.91 1.24 1.04 0.23 0.71 0.4 1.54 0.93 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 3 4 5 3 1 3 5 5 4 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 4 6 2 4 3 3 5 2 1 7 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 4 3 3 2 1 7 4 4 4 4 ...
## $ depth : num [1:1000] 61.3 62.7 62.3 63.3 65.4 60.1 62.1 62.3 59.3 63.3 ...
## $ table : num [1:1000] 60 59 55 57 60 57 59 56 59 61 ...
## $ price : int [1:1000] 3012 984 5431 4462 2134 465 2572 912 16736 3208 ...
## $ x : num [1:1000] 5.89 5.06 6.2 6.85 6.39 4 5.73 4.72 7.54 6.19 ...
## $ y : num [1:1000] 5.96 5.09 6.23 6.79 6.3 4.02 5.68 4.75 7.61 6.14 ...
## $ z : num [1:1000] 3.63 3.18 3.87 4.32 4.15 2.41 3.55 2.95 4.49 3.9 ...
ggplot2 Graphs – Scatter PlotAdding a third variable and adjusting formats…
With a trend line…
## `geom_smooth()` using formula 'y ~ x'
ggplot2 Graphs – HistogramAdding the cut variable…
In relative frequencies, per group…
Let’s try with a density plot…
ggplot2 Graphs – Bar PlotAdding the cut variable…
To compare absolute frequencies, dodged bars work better…
To compare cumulative relative frequencies, per-class stacked bars are more useful.
Bar plots can also be constructed from frequency counts. In this case, the y variable has to be stated and the argument stat="identity" has to be added to the geom_bar.
ggplot2 Graphs – BoxplotAdding the cut variable…
The chart can be improved by showing all the data points with the position modified with some random noise –jittered–.
Or including a violin plot, geom_violin, and an addtional point with the the mean.
ggplot2 – Additional ElementsBeyond the already presented elements –data, aesthetic mapping, and geometries–, other ggplot2 elements help adjust many other aspects of the plots, for example
scale_... controls aspects related to the presentation of the graphical variables,coord_... adjustes the coordinate system used in the plot,labs sets plot and axes titles,theme, theme_... control the non-data ink elements of the plot, andfacet_ facilitates the creation of multiple plots.One last example…
## Warning: Using shapes for an ordinal variable is not advised
ggplot(diaM, aes(x=carat, y = price,
shape = cut, col = clarity)) +
geom_point(alpha=.6) +
scale_x_continuous(breaks=1:3) +
scale_y_continuous(trans="log10") +
scale_color_brewer(palette="Spectral")+
facet_grid(cut ~ clarity) +
theme_bw() +
theme(legend.position = "none",
text = element_text(size=10))## Warning: Using shapes for an ordinal variable is not advised
OrchardSprays data set and make two plots that highlight the conclusions that can be reached from analyzing it.